Evaluating Entity Resolution Results

نویسندگان

  • David Menestrina
  • Steven Euijong Whang
  • Hector Garcia-Molina
چکیده

Entity Resolution (ER) is the process of identifying groups of records that refer to the same real-world entity. Various measures (e.g., pairwise F1, cluster F1) have been used for evaluating ER results. However, ER measures tend to be chosen in an ad-hoc fashion without careful thought as to what defines a good result for the specific application at hand. In this paper, our contributions are twofold. First, we conduct an analysis on existing ER measures, showing that they can often conflict with each other by ranking the results of ER algorithms differently. Second, we explore a new distance measure for ER (called “generalized merge distance” or GMD) inspired by the edit distance of strings, using cluster splits and merges as its basic operations. A significant advantage of GMD is that the cost functions for splits and merges can be configured, enabling us to clearly understand the characteristics of a defined GMD measure. Surprisingly, a state-of-the-art clustering measure called Variation of Information is a special case of our configurable GMD measure, and the widely used pairwise F1 measure can be directly computed using GMD. We present an efficient lineartime algorithm that correctly computes the GMD measure for a large class of cost functions that satisfy reasonable properties.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

A Practioner's Guide to Evaluating Entity Resolution Results

Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, group) across one or multiple databases. Ironically, it has multiple names: deduplication and record linkage, among others. In this paper we survey metrics used to evaluate ER results in order to iteratively improve performance and guarantee sufficient quality prior to deployment. Some of th...

متن کامل

Performance Comparison of Apache Spark and Tez for Entity Resolution

Entity Resolution is among the hottest topics in the field of Big data. It finds duplicates in datasets, which actually belong to same entity in the real world. Algorithms that perform Entity Resolution are computation intensive and consume a lot of time especially for large datasets. A lot of research has been conducted for improving Entity Resolution solutions. A number of algorithms are deve...

متن کامل

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...

متن کامل

Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination

Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PVLDB

دوره 3  شماره 

صفحات  -

تاریخ انتشار 2010